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Abstract 

Many approaches to transform classification problems from non¬ 
linear to linear by feature transformation have been recently presented 
in the literature. These notably include sparse coding methods and 
deep neural networks. However, many of these approaches require the 
repeated application of a learning process upon the presentation of 
unseen data input vectors, or else involve the use of large numbers 
of parameters and hyper-parameters, which must be chosen through 
cross-validation, thus increasing running time dramatically. In this 
paper, we propose and experimentally investigate a new approach for 
the purpose of overcoming limitations of both kinds. The proposed ap¬ 
proach makes use of a linear auto-associative network (called SCNN) 
with just one hidden layer. The combination of this architecture with 
a specific error function to be minimized enables one to learn a lin¬ 
ear encoder computing a sparse code which turns out to be as similar 
as possible to the sparse coding that one obtains by re-training the 
neural network. Importantly, the linearity of SCNN and the choice 
of the error function allow one to achieve reduced running time in 
the learning phase. The proposed architecture is evaluated on the 
basis of two standard machine learning tasks. Its performances are 
compared with those of recently proposed non-linear auto-associative 
neural networks. The overall results suggest that linear encoders can 
be profitably used to obtain sparse data representations in the con¬ 
text of machine learning problems, provided that an appropriate error 
function is used during the learning phase. 


1 Introduction 

Various approaches to transform classification problems from non-linear to 
linear by a feature transformation have been recently investigated. These no¬ 
tably include sparse coding methods [D [21 a n ig and deep neural networks 
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PC]. However, two remarkable limitations usually affect these approaches: 
(i) most of the current sparse coding methods [D Elia El see, for example,] 
require the repeated application of some learning process in order to com¬ 
pute an input sparse representation, whenever the algorithm is fed with an 
input vector which was never used during the training phase; (ii) even though 
auto-associative neural networks enable one to overcome limitation (i), many 
of them, such as deep belief networks, present several levels of complexity or 
non-linearity. In fact, their learning methods usually involve a large number 
of parameters and hyper-parameters, such as the number of hidden units for 
more than one hidden layer, learning rates, momentum, weight decay, and so 
on. These parameters must be chosen through cross-validation, thus leading 
to a dramatic increase of running time. Interestingly, some authors ilE] 
suggest that simpler architectures, requiring a reduced number of parame¬ 
ters to be found, may enable one to achieve state-of-the-art performance. 
Following up this suggestion, the use of a relatively “simple” neural network 
approach, overcoming limitations (i) and (ii), is proposed here and exper¬ 
imentally investigated in terms of a trade-off between computational costs 
and performances. 

The problem of overcoming limitations (i) and (ii) was addressed (a) by 
selecting an auto-associative neural network which enables one to learn a 
mapping between data and code space (the encoder) during the learning 
phase so that the learned mapping can be subsequently used on unseen data; 
and (b) by choosing just one linear hidden layer with identity as output 
function, jointly with an error function which allows one to learn at the same 
time both the encoder and the decoder by taking explicitly into account 
the contributions given by the two mappings. Given the linearity of the 
hidden layer and the selected error function, “good” learning rate parameters, 
ensuring linear rates of convergence in the minimization of error function, can 
be chosen in an explicit form without using cross-validation. 

Points (a) and (b) guarantee that a reduced number of parameters must 
be determined during the learning phase: these are the number of hidden 
nodes and the sparsity parameter value which controls the extent to which 
the coding is sparse. On account of this fact, the present approach turns out 
to be computationally less expensive than deep neural networks and standard 
sparse coding approaches alike. Importantly, the present approach enables 
one to compute a linear encoder by means of which, during the test phase 
(on unseen data), one obtains a sparse code which is as similar as possible 
to the sparse code that one obtains by re-training the neural network. By 
applying a non-linear operation such as soft-thresholding or soft-max on this 
code one obtains a non-linear feature transformation. The rest of the paper 
is organized as follows.The selected auto-associative neural network and its 
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relation to previously proposed approaches are discussed in Section 2 and 3, 
respectively. The experiment and its outcomes are presented in Section 4. In 
particular, in 14.11 tests showing the capability of our approach to reproduce 
PCA behaviour are described and discussed, whereas the ability to obtain 
appropriate sparse data representations enabling one to solve two standard 
machine learning tasks is evaluated in 14.21 14.3.11 and 14.3.21 Our approach 
is additionally compared with two auto-associative neural networks recently 
proposed in literature f|in|. [TT] and with current results presented in the 
literature). Finally, Section 5 is devoted to an analysis of experimental results 
and their significance. 

2 Background and related work 

In the context of machine learning, the problem that we are considering here 
is usually expressed in terms of a minimization problem as follows: 

mmu,D||X - UD^III + AO(U) 

Subject to ||T *'^||2 < (1) 

where X is a X x p matrix containing the N p-dimensional signals to 
be represented, D is a p x m matrix containing the basis vectors (or atoms) 
column-wise arranged, U is a iV x m matrix containing the sparse represen¬ 
tations of the signals in X, f2(U) is a norm or quasi-norm regularizing the 
solutions of the minimization problem, and the parameter A > 0 controls 
to what extent the representations are regularized. The regularization term 
penalizes the solutions containing many coefficients that are different from 
zero. 

Importantly, approaches of this kind give rise to a crucial limitation: after 
the learning phase, when an unseen data input vector x is considered, a min¬ 
imization process is again required to compute its sparse representation u. 
Sparse coding with auto-associative neural networks overcomes this limita¬ 
tion, and several approaches based on this observation have been accordingly 
proposed na [131 [71 na [la see, for example, ]. 

In a nutshell, these approaches are based on an encoder-decoder architec¬ 
ture. The input x is fed into the encoder which produces a feature vector 
u, i.e., a code of x. In turn, the code u is fed as input into the decoder 
module which reconstructs the input x from the code. Both encoder and 
decoder are feed-forward neural networks which may present several degrees 
of non-linearity. The encoder and the decoder are trained so as to minimize 
the error between input x and reconstructed input, with the proviso that 
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the code u must satisfy certain given constraints in order to obtain a sparse 
code of X. Sometimes a further term is added in (pO) so as to make the out¬ 
put of the encoder as similar as possible to the code u [21 |5l e.g.]. Such a 
term is added in our approach as well (see Section ED. Importantly, as men¬ 
tioned in the previous section, some authors mu identihed, as a drawback 
of many such architectures, their considerable complexity and computational 
cost. Notably, the learning processes involved in these approaches usually re¬ 
quire a large number of hyper-parameters such as learning rates, momentum, 
weight decay, and so on, that must be chosen through cross-validation, thus 
increasing running times dramatically. Furthermore, it is possible to achieve 
state-of-the-art performance by means of simpler architectures requiring the 
identihcation of a reduced number of parameters. 

Recently proposed architectures mm that we now turn to examine, 
involve non-linear auto-associative neural networks, called respectively AS- 
CNN and SAANN, and a single hidden layer. Both networks, differently 
from our approach, can be used with a non-linear activation function ip (e.g., 
sigmoid for SAANN and tanh for ASCNN) on the hidden layer units. 

In SAANN the error function expressed in ([2D is used. This function in¬ 
volves two terms. The hrst term is the standard reconstruction error between 
the input signals X and the reconstructed signals UD^with U = 99(XC^), 
where C (a.k.a. projection dictionary) is the weight matrix of the hrst weight 
layer of the network and D (a.k.a. reconstruction dictionary) is the weight 
matrix of the second weight layer of the network. The second term is a regu¬ 
larization term which imposes sparse codes of the input. As one may readily 
note from 0 . the second term is a function of the hidden units’ output. 
Moreover the optimization algorithm simultaneously hnds the reconstruc¬ 
tion dictionary D and the projection dictionary C by a standard gradient 
descent technique. 

B(D, C) = i||X - ^(XC^)D’'fy + A EE log{l + Un^) (2) 

n i 

In ASCNN the problems of hnding the reconstruction and the projection 
dictionaries are addressed separately by dividing the autoassociative network 
into two subnetworks: top network, and bottom network. The optimization 
algorithm involves two basic steps: 

1. Only the Bottom Network is considered. The input to the hidden 
units Z together with the reconstruction dictionary D are obtained by 
minimizing the following error function: 
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B(Z,D) = i||X-»>(Z)D’'fy + A/('^^ (3) 

an alternating optimizations techniqne is used to find Z and D. The 
first term in equation ([3]) is the usual reconstruction error, the second 
term is a regularization term which imposes sparse solutions for the 
input codes. The sparseness function /(•) is chosen on the basis of 
the activation function if of the hidden units. A possible choice is 
/(a) = log{l + a?) when the activation function corresponds either to 
tanh or to a linear function. A is a positive value which controls the 
relevance of the regularization term. 

2. Only the Top Network is considered. The projection dictionary C is 
retrieved by minimizing the following error function with respect to : 

E(C) = i||XC’' - Z\\l (4) 

leaving Z hxed. 

Finally the whole autoassociative network is considered in order to achieve 
a hne tuning of network parameters (Z, C, and D) by minimizing the error 
function (jSj) and by taking into account that Z = XC^. 

Both networks, ASCNN and SAANN, are trained by means of the mini 
batch stochastic gradient descent learning algorithm because the error func¬ 
tion is differentiable as the penalization term is differentiable. 

The possibility of identifying valid alternatives to non-linear approaches 
to sparse coding, in the context of non-linear autoassociative networks with 
a single hidden layer, is suggested by some classihcation which were success¬ 
fully addressed on the basis of linear network approaches. Notably, linear 
approaches were successfully used to model the early stage responses of the 
visual system. In the early stage visual information is a small number of 
simultaneously active neurons among the much larger number of available 
neurons. The hrst attempt to model this behaviour of the visual system is 
due to Olshausen and Field (|TB]). The authors built a simple single weight 
layer feed forward neural network (Sparsenet) where the observed data X 
are a linear combination of top-bottom basis vectors D and top-layer sparse 
responses U. The sparsity of the solution is obtained by minimizing an error 
function composed of the standard sum-of-squares error and a regularization 
term, which can be expressed as follows: 

E(D,U) = i||X-UD^|||.+Af;j('A'\ (5) 

i=l ^ ' 
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where is a sparse inducing function, cr is a scaling constant, and A is the 
usual positive value which controls the relevance of the regularization term. 


3 Sparse Coding Neural Network: a linear ap¬ 
proach 

In order to investigate more systematically linear approaches, in the context 
of encoder-decoder architectures, we introduce here: 

• an auto-associative network with linear activation functions only; 

• an error function which allows one to learn at the same time both pro¬ 
jection and reconstruction dictionaries by taking explicitly into account 
the contributions given by these two dictionaries; in particular, we in¬ 
troduce a term in the error function which enables one to obtain a 
sparse code which is as similar as possible to the output of the encoder. 

• a hard sparse coding approach, i.e., a limited number of values different 
from zero, by means of a non-differentiable term in the error function. 
We use hard sparse coding also in view of the fact that it is an efficient 
representation of biological network behaviours (El)- 

Thus, we build a linear autoassociative neural network with two weight lay¬ 
ers. From now on, this network will be called Sparse Coding Neural Network 
(SCNN). During the learning phase SCNN can be regarded as being formed 
by two independent sub-networks (see hgure ([ID): a top network, T-SCNN, 
which includes both the projection dictionary C and the SCNN hidden layer, 
and a bottom nework, B-SCNN, which includes both the reconstruction dic¬ 
tionary D and the SCNN output layer. In this phase, T-SCNN and B-SCNN 
are iteratively and successively trained. This training process is fundamen¬ 
tally based on two consecutive stages. First stage: both B-SCNN input 
signals,U, and the reconstruction dictionary, D, are learned by considering 
X as target values and by imposing a specihc constraint on U to obtain 
sparse input signals. Second stage: the projection dictionary C is learned for 
T-SCNN by considering X as input values and U as target values. Conse¬ 
quently, SCNN is trained by minimizing the following global error function: 
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Figure 1; A graphical representation of the SCNN network. Top network 
(T-SCNN) includes both the projection dictionary C and the hidden layer. 
Bottom network (B-SCNN) includes both the reconstruction dictionary D 
and the SCNN output layer. 


N p 


i?(D, c,u) Tni ^ ^ Ufihdih 


P 


n=l i=l 
N m 


h=l 

P 


^ ^ ^ ( “nj ^ XnhCjh j + 


n=l j=l 
N m 

Z/A Y Y "v I I 

n=l h=l 


h=l 


< 1 


h=l 


( 6 ) 


where Xni is the z-th component of the n-th input signal, Unh is the h-th 
component of the input code when the network is fed with the n-th input 
signal, dih is the weight associated to the connection going from the h-th 
hidden unit to the z-th output unit, and Cjh is the weight associated to the 
connection going from the h-th input component to the j-th hidden unit. 
The hrst term forces the bottom network to reconstruct correctly the input 
signals X on the basis of both U and D, the second term in ([H]) constrains 
the solutions of the minimization problem to obtain B-SCNN input {unj) 
as similar as possible to T-SCNN output {J2T=i^nhCjh), and the last term 
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imposes a sparse representation for U. Moreover, a quadratic constrain on 
the columns of D is imposed to avoid U being the null matrix. 

Let us now describe in more detail the proposed learning approach. Since 
the error function (([6])) is separately convex in each variable, we proceed 
by iteratively minimizing with respect to B-SCNN input variables while 
keeping unchanged both dij and Cjk (B-SCNN input update), then minimizing 
with respect to the reconstruction weights dij while keeping unchanged both 
Uni and Cjk (B-SCNN weight update) and, hnally, minimizing with respect 
to the projection weights Cjk while keeping unchanged both dij and Uni (T- 
SCNN weight update). Importantly, the last term in ([6]) is not differentiable, 
and consequently it does not allow one to perform a classical gradient descent 
learning algorithm. 

One can overcome this difficulty using proximal methods. Since the par¬ 
tial derivatives with respect to the weights dij and Cjk do not involve the non- 
differentiable term, these derivatives can be computed following the standard 
back-propagation approach: 




where Xni is taken as target for the bottom network , dih is the weight 
from neuron h to neuron i of the bottom network, and Unh is h-th component 
of the n-th B-SCNN input vector. 




where Unj is the j-th component of the n-th B-SCNN’s input vector taken 
as target for the top network, Cjh is the weight from neuron h to neuron j of 
the top network, and Xnh is the h-th component of the n-th T-SCNN’s input 
vector . 

Thus the update of the weights dij and Cjk is obtained on the basis of a 
standard gradient descend as follows: 



(9) 


( 10 ) 


where s is the iteration step, and rjc, are the learning rates. Moreover, 
to fulhll the quadratic constrain in (([6])) a projection operator on the unit 
ball is applied on the columns of D. 
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As mentioned above, since the last term in ([H]) is not differentiable, the 
update of U must be performed using a proximal algorithm. In summary, a 
proximal algorithm minimizes a function of type E{^) = Ei{^) + E 2 {^), where 
El is convex and differentiable, with Lipschitz continuous gradient, while E 2 
is lower semicontinuous, convex and coercive. These assumptions on Ei 
and E 2 are necessary to ensure the existence of a solution. The proximal 
algorithm is given by combining a projection operator (P) with a forward 
gradient descent step, as follows: 

r = p(r-‘-^vFK—)) (11) 

The step-size of the inner gradient descent is governed by the coefficient 
a, which can be fixed or adaptive. 

In our case the ([6]) can be minimized with respect to U if considered as 
E(U) = Pi(U) + P 2 (U) where 


^ N p / m \ ^ N m / p 
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while the proximity operator corresponding to P is the operator, named 
soft thresholding, defined as 


P\{unk) = sign{unk)'max{\unk\ - A, 0} 


(13) 
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Now, replacing the flT^ and the flT^ into the flTT]) and rearranging the 
terms, we obtain the following npdating rule for U; 


u 


S 

nk 


Px 



dE,\ 
dUnk) 


Px 




u 


s-1 

nk 


+ 



(14) 


where rju is the learning rate. 

Thus, the update expressed in (1141) plays a double role: it allows one to 
hud, on the one hand, sparse B-SCNN’s input signals Unk able to reconstruct 
the input data X, and, on the other hand, Unk values which can be "well 
approximated" by the T-SCNN’s output. 

Furthermore, it is important to note that the choice of the parameters 
rju, rjc and rju is crucial to achieve a minimum of the error function ([6]) as 
soon as possible. For an error function as E = Ei + E 2 , the learning rate 
parameters can be chosen on the basis of the Lipschitz constant of VEi to 
speed up the iterative process. Hence, we set the parameters rju, rjc and rju, 
as follows: 


1 

“ 2||iD^D + Ai|| 

II a mil 

(15) 

p 

2||UU^||| 

(16) 

m 

~ 2||XX^||| 

(17) 


This choice of the learning rate values ensures linear rates of convergence 
in the minimization of the error function [18] and convergence of both recon¬ 
struction and projection dictionary towards a minimizer [18]. In Algorithm 
(fTj) the pseudocode of the learning process used to train SCNN is presented. 


4 Experiments 

Experiments of two different kinds were conducted. The first series of ex¬ 
periments is aimed at evaluating the capability of the networks to reproduce 
PCA behaviour (see Subsection 14.ip . In the second series of experiments. 
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Algorithm 1 Learning algorithm for training SCNN network. 

INPUT: inpnt data X. 

OUTPUT: SCNN weight matrices D and C. 

Inizialization: initialize U°, D° and C° with random valnes taken from a 
nniform distribntion of valnes between —1 and 1. 

Parameters: setting the maximnm number of external iterations Tmax, and 
the number of hidden neurons m. 

1. s^l 

2. REPEAT 

(a) B-SCNN input update. B-SCNN inputs, U^, at the current 
step s, are computed on the basis of and according to 
the equation ffl^ until convergence is reached; 

(b) B-SCNN weigth update. B-SCNN weights, D®, at the current 
step s, are computed on the basis of U®“^ and according to 
(]7|) until convergence is reached; 

(c) T-SCNN weigth update. T-SCNN weights, at the cur¬ 
rent step s, are are computed on the basis of and U®“^ 

according to the (E]) until convergence is reached; 

(d) E{s) ■(— i||X —the error at the current step s is computed 
as Frobenius norm between the train set X and its reconstructed 
version X^gg = X(C^)^(D^) ^ obtained as the result of the forward 
propagation of X through the global network SCNN; 

(e) s •(— s -|- 1; 

3. UNTIL stop — conditionCE, rtol, T^ax) is TRUE; 
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we evaluated the performance of our approach on the basis of two stan¬ 
dard machine learning tasks. The first task is a missing-pixels problem (see 
Subsection 1121). Here, we focused on evaluating the ability of the selected 
networks to obtain appropriate sparse data representations which enable one 
to solve the task. The second task concerns hand-written digit classihcation 
(see subsections 14.3.1]) . In this latter task we hrst compared SCNN with the 
selected networks (see subsections I4.3.ip . and then we measured the error 
rate of our approach at varying the number of hidden nodes. Our results 
were compared with major extant approaches (see Subsection I4.3.2p . More¬ 
over, we evaluated the effect of noise on the performance of SCNN. The main 
neural network parameters were set up as follows; 

• initialization of the weights. The initial weights were randomly chosen 
in the set [—1,1]; 

• learning rates were chosen for SCNN to be the Lipschitz constants. The 
maximum number of iteration were always hxed to: 1000 for step (a), 
500 for step (b), 500 for step (c), and 50 for external loop (seed]); 

• learning rate and maximum number of iterations for ASCNN and SAANN 
have been chosen in accordance with Hi and m, respectively; 

• threshold was selected as stop condition. We heuristically tried different 
choices for all methods, and best results are reported; 

• A was chosen according to what is specihed in each experiment. 

4.1 Comparing SCNN with PC A 

Our approach is similar to a standard linear auto-associative network when 
one sets A = 0. However, in this case our error function is different from that 
used by linear auto-associative networks to reproduce PCA results. Hence, 
it is not obvious that the proposed method has solutions equal to PCA when 
A is 0. Thus, the experiments in this Section are aimed at experimentally 
verifying whether our approach gives rise to a behaviour which is similar to 
PCA as worst case. For this reason the networks were trained by setting the 
sparsity parameter A = 0. To evaluate the networks’ performance we built 
a training set extracting 2000 patches (small parts of the whole image) of 
size d = 8 X 8 pixels from the Berkeley segmentation database of natural 
images [12], which contains a high variability of scenes. We set the number 
of hidden nodes equal to the principal components (10, 30, 50). Note that 
the number of hidden nodes is always less than d = 64, i.e., the maximum 
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0.4 


Simulating PCA performance 



Figure 2: The figure shows the values of the RMS Error versus the number 
of principal components. Note that the number of hidden nodes of the con¬ 
sidered neural networks is equal to the number of principal components. The 
standard deviation is not shown because, for all methods, it is less than 0.02. 


number of the principal components available. In this experimental setting 
the lowest reconstruction error is given by the PCA. 

We trained the networks and, then, evaluated the reconstruction error 20 
times. Each time we projected the input through the projection dictionary C 
and reconstructed it through the reconstruction dictionary D. We evaluated 
mean and standard deviation of the reconstruction error; as shown in Figure 
[2] SCNN and ASCNN can approximate PCA better than SAANN does. It 
is worth noting that by increasing the principal components’ number the 
reconstruction error in SCNN and ASCNN decreases more than the SAANN’s 
reconstruction error. The reconstruction error was computed as the Root 
Mean Square (RMS) error between original and reconstructed data. 
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4.2 Missing Pixels 

In this test we compared our approach with ASCNN, SAANN and Sparsenet 
on a well known machine learning problem called missing-pixels. We aim 
to reconstruct corrupted (that is, setting randomly to zero a certain amount 
of pixels) patches from a test set after having trained the networks on a 
training set of non-corrupted patches |1]. We extracted 4000 patches of size 
p = 8 X 8 pixels from the Berkeley segmentation dataset of natural images 
|19) . We then split this dataset into a training set and a validation set of size 
2000 and 2000 patches, respectively. All the patches are centered so as to 
have zero mean. As test set, we used 2000 patches extracted from the test 
images of the Berkeley dataset which were never seen by the networks before. 
All the patches of both validation and test set were corrupted by setting to 
zero the same percentage of pixels. In particular, we chose five noise levels 
corresponding to 10%, 20%, 30%, 40% and 50% of missing pixels. The four 
different approaches (ASCNN, SAANN, Sparsenet and SCNN), were applied 
on the training set using 40 equispaced values of A in the range [0.01, 20] for 
Sparsenet, and in the range [0.01,1] for the remaining networks. For each 
network the best solution was chosen on the validation set by considering 
the minimum reconstruction error. Finally, the performances of the networks 
were computed on the test set. Note that for each approach the reconstructed 
images for the test set were obtained using the projection dictionary chosen 
during the validation phase, except for Sparsenet where a learning process 
is again required. As shown in Figure [3] SCNN is able to reconstruct a test 
image better than ASCNN and SAANN for all the noise levels, whereas its 
performances are comparable with those of Sparsenet. 

Moreover, we computed the sparsity values of the four networks on the 
test set. We selected the solutions corresponding to the parameter A which 
gave the best performance on the validation set with noise equal to 10%, 
20%, 30%, 40% and 50%. 

Note that the sparsity value was defined as 1 — ^ - 6{uij — Ths), 

where Uij are the coefficients of the j-th signal of the test set, 9{x) is the 
Heaviside’s function, and Ths is a threshold value ranging in [0,0.5]. This 
approach to computing sparsity values was motivated by the fact that many 
coefficients might be near to zero, but not exactly zero. Consequently, this 
choice enables one to achieve a better evaluation of the networks ability to 
produce sparse data representations. 

Figure m shows these sparsity values against Ths values. Note that also 
in case of very low Ths values SCNN reaches high values of sparsity, and for 
the different noise levels this ability is preserved. The performance of SCNN 
is equal to or better than the other selected methods. 
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Figure 3: Reconstruction Error versus noise. Noise values correspond to 10%, 
20%, 30%, 40% and 50% of missing pixels. 
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Figure 4: Plot of the sparsity achieved on a test image versus the threshold 
Ths for different noise values. The best performance is obtained by SCNN 
which turns out to be the more stable method with respect to increasing 
percentages of noise. 
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4.3 Hand-Written Digit Classification 

In this test we considered a hand-written digit classification problem. The 
MNIST dataset |20] is used here. We extracted three datasets: a training 
set Digittr, a validation set Digit^ai and a test set DigitT- Each image be¬ 
longing to these datasets was transformed by re-mapping the pixel range 
value into the interval [—1,1]. Each network was applied on the training 
set using different values of A. For each A value a linear SVM was used as 
multi-class classifier 1211 . The classifier was trained with the sparse image 
representations corresponding to the training set DigitTr, and then it was 
fed with the sparse image representations corresponding to the validation 
set Digityai- For each network the best solution was chosen on the valida¬ 
tion set considering the classification accuracy (defined as the ratio between 
correctly classified images and the total number of images belonging to the 
set). Finally, the performances of the networks were computed on the test 
set DigitT considering the performance of the multi-class classifier when it is 
fed with the sparse image representations obtained by the previously selected 
projection dictionaries. The test was organized into two phases. In the first 
phase, we compared our approach with ASCNN, SAANN and Sparsenet on a 
reduced subset of the MNIST dataset (see Section l4. 3. 1[) . In a second phase, 
we measured the performance of our approach on the whole MNIST dataset, 
in order to make a comparison possible with other methods presented in the 
literature (see Section l4.3.2p . 

4.3.1 Comparing SCNN with ASCNN, SAANN and Sparsenet 

Here SCNN is compared with ASCNN, SAANN and Sparsenet. The test is 
organized in three parts. In the first part, for a fixed size of the input, we 
computed classification accuracies on the basis of the data representations of 
the four networks at varying the number of hidden nodes, i.e., the number 
of atoms of the reconstruction dictionary. Thus, we evaluated how the inner 
complexity of each neural network is reflected on its performances. Moreover, 
we measured the computational times of each network during both learning 
and validation phase to compare the computational costs of our approach 
with the other methods. In the second part of the experiment, for each 
approach we chose the neural architecture with a number of hidden nodes 
producing a “satisfactory” classification accuracy, and evaluated whether the 
ability to obtain data representations with high sparsity values was preserved 
in this case. In particular, in order to get a better insight in the relation 
between accuracies and sparsity values, we computed first the area under 
the curve obtained by plotting the sparsity values versus the threshold Ths, 
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called sparsity area, and then we compnted the accnracies versns sparsity 
areas. In the last part of the experiment, we evalnated to what extent each 
approach snffers from the “cnrse of dimensionality” |22]. Accordingly, hxed 
the nnmber of hidden nodes we compnted the classihcation accnracies of the 
fonr networks at varying sizes of the inpnt. 

The sizes of the three datsets Digittr, Digityai and Digits were chosen 
eqnal to 500 digits. We chose 40 eqnispaced valnes of A in the range [0.01, 20] 
for Sparsenet, and in the range [0.01,1] for the other networks. The dictio¬ 
nary dimension in the hrst part of the test, Fignre[5l was varied from 25 to 150 
with a step of 25, and the digits were reshaped into a matrix of dimensions 
14 X 14. Notably, both Sparsenet and SCNN reach high valnes of classih¬ 
cation accnracies nsing a rednced nnmber of atoms, and the classihcation 
accnracy seems to be weakly dependent on this nnmber, whereas ASCNN 
and SAANN need more than 100 hidden nodes to reach performances that 
are comparable to those of Sparsenet and SCNN. In Fignre[6]and[7]we show 
the means and the standard deviations of the compntational times at varying 
the A parameter for each dictionary dimension, and for all methods, dnring 
the learning phase and the validation phase, respectively. In Fignre E] one 
can note that SCNN is nniformly faster than the other methods dnring the 
learning phase on all dictionary dimensions. In Fignre [7] Sparsenet compn¬ 
tational times are not shown becanse they are an order of magnitnde greater 
than those of the other methods. Also in this case, one can note that SCNN 
is faster than the other approaches for each dictionary dimension. 

In the second part of the test, we hxed the size of the dictionary to 100 and 
compnted classihcation accnracies and sparsity areas nsing 100 eqnispaced 
valnes of the sparsity parameter A. In Fignre [8] classihcation accnracies ver¬ 
sns sparsity areas are showed for each network. Fignre [S] shows that the 
best accnracy valnes are obtained by SCNN and Sparsenet. In particniar, 
Sparsenet reaches the maximnm valne (0.88) among the selected approaches. 
By means of SCNN one obtains high accnracy valnes (more than 0.80) pre¬ 
serving high sparsity valnes in terms of sparsity area. The other algorithms 
(ASCNN and SAANN) exhibit high sparsity valnes bnt in connection with 
lower valnes of classihcation accnracy only. More specihcally, for high spar¬ 
sity valnes both SAANN and ASCNN reach accnracy valnes lower than 0.8. 
On the whole, SCNN enables one to reach both high accnracy valnes and 
high sparsity valnes. 

In the last part of the test, Fignre [H we hxed both the size of the dictio¬ 
nary (100) and the size of the training set, while digits were reshaped into a 
matrix of dimensions from 14 x 14 to 28 x 28. From Fignre [HI one notices 
that the accnracy valnes reached by SCNN and Sparsenet do not seem to 
be ahected by inpnt size. By contrast, ASCNN and SAANN tnrn ont to be 
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strongly dependent on input size. 


4.3.2 SCNN Error Rate and Noise tolerance 

In this phase the sizes of three datasets Digitt^, Digit^ai and Digits were 
of 50000, 10000 and 10000 digits, respectively. The test was organized in 
two parts. In the hrst part, we measured the error rate of the linear multi¬ 
class classiher on the sparse digit representations obtained by our approach 
at varying numbers of hidden nodes. The experimental setting was basically 
left unchanged with respect to the previously described setting. In particular, 
we used 10 equispaced values of A in the range [0.02,0.2], and a number of 
hidden node equal to 400, 800,1200 and 1600. The sparse representations 
of Digittr, Digitvai and Digitx were obtained by applying the projection 
dictionary followed by a soft-thresholding operation. 

In the second part of the test, we evaluated the effect of noise on the per¬ 
formance of SCNN. In particular we compared SCNN, raw-data and SCNN 
with a re-learning phase on the test set (without using a projection dictio¬ 
nary) . We obtained 5 versions of the MNIST dataset by adding Gaussian 
white noise of mean 0 and standard deviation a = 0.02, 0.04, 0.06, 0.08. On 
each dataset we repeated the previously described experimental setting with 
a number of hidden nodes equal to 400. 

The results show that our approach leads to performances that are quite 
good for this non-linear classihcation task, insofar as they are consistently 
better than those obtained by means of linear classihers on raw images across 
various projection dictionary sizes. In addition, the best error rate was equal 
to 2.0% which is better than or comparable to state-of-the-art results that are 
based on unsupervised feature learning plus linear classihcation without using 
additional image geometric information (see Table [T] and |2]) . In particular, 
we note that the error rate of deep belief network is very similar to that 
obtained by SCNN. Moreover, our approach seems to be little affected by 
noise (see Table |2]) . 


5 Conclusions 

In this paper we showed that a linear two-layer neural netwok (SCNN) can 
be prohtably used to achieve sparse data representations for solving machine 
learning problems. 

In our approach hidden layer linearity and the specihc choice of error 
function allow one to explicitly dehne learning rate values which ensure linear 
rates of convergence in the minimization of error function and convergence of 
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Figure 5: Classification accuracy versus number of components. SCNN 
achieves high accuracy even with 20 hidden nodes while ASCNN and SAANN 
achieve the same performances as SCNN only when more than 100 hidden 
nodes are allowed. 
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Figure 6: Computational times during the learning phase for each neural 
network with respect to the dictionary dimension. The bars show the means 
of the computational times at varying the A parameter. The error bars 
represent the standard deviations. 
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Figure 7: Computational times during the validation phase for each neural 
network with respect to the dictionary dimension. Sparsenet computational 
times are not shown because they are an order of magnitude greater than 
those of the other neural networks. The bars show the means of the com¬ 
putational times at varying the A parameter. The error bars represent the 
standard deviations. 
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Figure 8: Classification accuracy values versus sparsity areas. SCNN seems 
to achieve the best compromise between sparsity area and accuracy in cor¬ 
respondence to sparsity area ~ 0.8 and accuracy ~ 0.8. 
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Figure 9: Classification accuracy versus relative size. Leaving fixed the num¬ 
ber of hidden nodes and the size of the training set while increasing the 
number of inputs by reshaping the digits, SCNN is very robust with respect 
to the “course of dimensionality”. 


24 







































Methods 

Error rate (%) 

Raw image -|- Linear SVM 

12 [23] 

Sparse coding -|- linear SVM 

2.02 [21] 

Deep Belief Network -|- linear SVM 

1.90 [22] 

Stacked RBM network 

1.2 [22] 

Map transformation cascade -|- linear SVM 

1.90 [21 

Local Kernel smoothing 

3.48 [21 

VQ coding -|- linear SVM 

3.98 [21 

Laplacian eigenmap -|- linear SVM 

2.73 [21 

Large Conv. Net, unsup. pretraining 

0.53 [21 

Local coordinate coding -|- linear SVM 

1.90 [21 

Human 

0.2 [21 


Table 1: Error rates (%) of MNIST classification with different methods. 


# hidden units 

400 

800 

1200 

1600 

SCNN + linear SVM 

2.7 

2.4 

2.1 

2.0 


Table 2: Error rates (%) of MNIST classihcation by SCNN + linear SVM 
against different number of hidden units. 


Noise (a) 

0.02 

0.04 

0.06 

0.08 

Raw image -|- linear SVM 

6.3 

6.8 

7.7 

8.7 

SCNN + linear SVM 

3.8 

4.2 

4.7 

5.5 

SCNN with re-learning -|- linear SVM 

4.9 

4.8 

5.7 

6.9 


Table 3: Error rates (%) of MNIST classihcation by SCNN + linear SVM with 
a number of hidden units equal to 400 against different noise values. These 
value are also compared with those obtained by raw images and SCNN with 
a re-learning phase on the test set (without using a projection dictionary). 
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both reconstruction (decoder) and projection (encoder) dictionaries towards 
a minimizer [T51125| see]. Moreover, we have to set the number of units of 
just one hidden layer, unlike deep network approaches, and the linearity of 
the hidden layer allows one to perform a learning process which is compu¬ 
tationally less expensive than a non-linear approach for a time constant. In 
this sense, we claim that our architecture is a comparatively " simple" one. 

SCNN reaches a very similar reconstruction error compared to PCA as de¬ 
scribed in Section 4.1. Interestingly, by means of our approach, we obtained 
performances that are comparable to or better than non-linear methods. 
More specihcally, the experiments in Section 141^ and 14.3.11 show that our ap¬ 
proach produces sparse data representations enabling one to solve standard 
machine learning problems. SCNN outperforms ASCNN and SAANN in all 
cases; and its performances are comparable to those of Sparsenet, which does 
not use an encoder to project unseen data to sparse code, but requires iter¬ 
ation of a learning process. In the hand-written digit classihcation problem 
(see Section l4.3.2p . our results are competitive with respect to state-of-the-art 
results that are based on unsupervised feature learning plus linear classihca¬ 
tion without using additional image geometric information. In addition, our 
approach seems to be little affected by noise. 

It is worth noting, in connection with Section 14.21 that during the test 
phase our approach (SCNN) reaches high sparsity values comparable with 
or better than the values obtained by ASCNN and Sparsenet that make use 
of non linear activation functions and a new learning process, respectively. 
SAANN has sparsity values that are much lower than those obtained by all 
other methods. Figure 4 shows that the SCNN linear encoder projects in¬ 
put data with a degree of sparsity greater than or equal to that obtained 
by ASCNN and Sparsenet for different noise levels. In connection with Sec¬ 
tion 14.3.11 it is worth noting that both SCNN and Sparsenet achieve high 
accuracy values (more than 0.80) preserving high sparsity values in terms of 
sparsity area. ASCNN and SAANN exhibit also high sparsity values but in 
correspondence of lower values of classihcation accuracy. In the hand-written 
digit classihcation problem (see Section l4.3.21 linear projection dictionary fol¬ 
lowed by soft-thresholding operator enable one to obtain an error rate equal 
to about 2%, that is, a better value than that obtained by other approaches 
such as Local Kernel smoothing, VQ coding and Laplacian eigenmap. Fur¬ 
thermore, this value is comparable to standard sparse coding approaches and 
deep network (see Table [H and [2]) . 

It is worth noting that the learning process based on an alternate updating 
of the dictionaries C and D (see ASCNN and SCNN learning processes, 
sections 2 and 3) appears to lead to better results than the update involving 
two dictionaries simultaneously (see SAANN learning process. Section 2). In 
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fact, SAANN achieves in many cases worse results than SCNN and ASCNN 
(see for example hgures 2, 3, and 6). 

Altogether, these results suggest that linear encoders can be used to ob¬ 
tain sparse data representations that are useful in the context of machine 
learning problems, providing that an appropriate error function is used dur¬ 
ing the learning phase. In particular, on the basis of some suggestions made 
in [211 [2] we introduced a term in the error function which enables one to 
obtain a sparse code which is as similar as possible to the output of the 
encoder. 
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